Skip to content

MB-71397: Fix CPU OOM during GPU Train#76

Open
CascadingRadium wants to merge 1 commit into
fixSizefrom
fixOOM2
Open

MB-71397: Fix CPU OOM during GPU Train#76
CascadingRadium wants to merge 1 commit into
fixSizefrom
fixOOM2

Conversation

@CascadingRadium
Copy link
Copy Markdown
Member

  • Training large datasets with high-dimensional vectors on GPU caused an out-of-memory
    crash due to allocating residuals for the full training set.
  • The GPU scalar quantizer encoder training now subsamples the input to a bounded number
    of vectors, mirroring the existing CPU behaviour.
  • The encoder training vector limit is propagated correctly when cloning a CPU index to GPU.

Co-authored-by: Copilot <copilot@github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses CPU out-of-memory crashes during GPU training of IVF+ScalarQuantizer indexes by introducing training-time subsampling (to avoid allocating full-size residual buffers) and by propagating the encoder-training vector limit when cloning a CPU IVF index to GPU.

Changes:

  • Add subsampling in GpuIndexIVFScalarQuantizer::trainResiduals_ using fvecs_maybe_subsample.
  • Introduce GpuIndexIVF::train_encoder_num_vectors() and store an encoder-training vector limit in GpuIndexIVF.
  • Propagate IndexIVF::train_encoder_num_vectors() from CPU to GPU in GpuIndexIVF::copyFrom.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
faiss/gpu/GpuIndexIVFScalarQuantizer.cu Subsamples training vectors before residual computation to reduce CPU memory usage during GPU training.
faiss/gpu/GpuIndexIVF.h Adds a virtual encoder-training vector limit accessor and stores the propagated limit.
faiss/gpu/GpuIndexIVF.cu Copies the CPU encoder-training vector limit into GPU state and exposes it via a new accessor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread faiss/gpu/GpuIndexIVFScalarQuantizer.cu
@CascadingRadium
Copy link
Copy Markdown
Member Author

When training on the CPU index we hit this code:

void IndexIVF::train(idx_t n, const float* x) {
    if (verbose) {
        printf("Training level-1 quantizer\n");
    }

// Train Quantizer

    train_q1(n, x, verbose, metric_type);

    if (verbose) {
        printf("Training IVF residual\n");
    }

    // optional subsampling
    idx_t max_nt = train_encoder_num_vectors();
    if (max_nt <= 0) {
        max_nt = (size_t)1 << 35;
    }

// Train Residuals

    TransformedVectors tv(
            x, fvecs_maybe_subsample(d, (size_t*)&n, max_nt, x, verbose));

    if (by_residual) {
        std::vector<idx_t> assign(n);
        quantizer->assign(n, tv.x, assign.data());

        std::vector<float> residuals(n * d); // <--- OOM LINE
        quantizer->compute_residual_n(n, tv.x, residuals.data(), assign.data());

        train_encoder(n, residuals.data(), assign.data());
    } else {
        train_encoder(n, tv.x, nullptr);
    }

    is_trained = true;
}

We basically:

  • Train the quantizer
  • Train the encoder/residuals

On the GPU side of things, we do not subsample the vectors resulting in hitting the OOM LINE on the GPU code, since we will create an unbounded vector.

This patch fixes this by mimicing the CPU behaviour for subsampling of the residual vectors

Copy link
Copy Markdown
Member

@Thejas-bhat Thejas-bhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, but can you add the MB associated with this?

@CascadingRadium CascadingRadium changed the title Fix CPU OOM during GPU Train MB-71397: Fix CPU OOM during GPU Train May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants